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Abstract 

In this work we report on our experiences running OpenMP programs on a commodity cluster 
of PCs running a software distributed shared memory (DSM) system. We describe our test en- 
vironment and report on the performance of a subset of the NAS Parallel Benchmarks that have 
been automaticaly parallelized for OpenMP. We compare the performance of the OpenMP imple- 
mentations with that of titeir message passing counterparts and discuss performance differences. 


1 Introduction 

Computer Architectures using clusters of PCs with commodity networking have become a low cost 
alternative for high end scientific computing. Currently message passing is the dominating program- 
ming model for such clusters. The development of a parallel program based on message passing adds 
a new level of complexity to the software engineering process since not only computation, but also 
the explicit movement of data between the processes must be specified. 

Shared memory parallel processors (SMP) provide a user friendlier programming model. The use 
of globally addressable memory allows users to exploit parallelism while avoiding the difficulties of 
explicit data distribution on parallel machines. Parallelism is commonly achieved by multi-threading 
the execution of loops. Compiler directives to support multithreaded execution of loops are supported 
on most shared memory parallel platforms. In addition, many compilers provide an automatic paral- 
lelization feature taking all tl ie burden of code analysis off the user. Efficiency of compiler parallelized 
code is often limited, since ; thorough dependence analysis is not possible without user information. 
Alternatively, there are parallelization support tools available which take the tedious work of depen- 
dence analysis and generation of directives off the user but allow user guidance for critical parts of the 
code. An example of such a tool is CAPO [10]. 

While shared memory architectures provide a convenient programming model for the user, their 
drawback is that they are expensive and the scalability of the code may be limited due to poor data 
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locality and possibly large synchronization overhead. During recent years there have been consider 
able efforts to develop system software to support DSM (Distributed Shared Memory) programming 
which enables the user to employ the convenient shared memory programming model on a network 
of processors, thereby maintaining the ease of use while maintaining the low cost of hardware. Ex- 
amples of such systems are TreadMarks [2] and SCASH [13]. These systems allow the support of 
OpenMP parallelization on clusters of processors, thereby removing the major impediment to their 
usage which is the high effort to develop a message passing version from a sequential program. We 
have installed publicly available DSM software on a commodity cluster of PCs and tested its perfor- 
mance on a set of benchmark kernels. The paper seeks to address the issue of evaluating the efficiency 
of DSM without explicit hardw are support. The rest of the paper is structured as follows: In section 2 
we discuss the message passing and the shared address space programming models. In section 3 we 
describe the hardware platfo m and system software of our test environment. In section 4 we describe 
our evaluation strategy and discuss the performance of the individual benchmark kernels. In section 5 
we discuss some of the prob ems we encountered. In section 6 we briefly examine some related work 
and in section 7 we summarize our conclusions and discuss future work. 


2 Programming Models 

Currently message passing tnd shared address space are the two leading programming models for 
clusters of SMPs. 

2.1 Message Passing 

Message passing is a well understood programming paradigm. The computational work and the as- 
sociated data are distributed between a number of processes. If a process needs to access data located 
in the memory of another process, it has to be communicated via the exchange of messages. The 
data transfer requires coope rative operations to be performed by each process, that is, every send 
must have a matching receive. The regular message passing communication achieves two effects: 
communication of data from sender to receiver and synchronization of sender with receiver. 

MPI (Message Passing Interface) [12] is a widely accepted standard for writing message passing 
programs. It is a standard programming interface for the construction of a portable, parallel appli- 
cation in Fortran or in C/C which is commonly used when the application can be decomposed 
into a fixed number of processes operating in a fixed topology (for example, a pipeline, grid, or tree). 
MPI provides the user with a programming model where processes communicate by calling library 
routines to send and receive messages. Pairs of processes can perform point-to-point communication 
to exchange messages. For increased convenience and performance a group of processes can also 
call collective communicati >n routines to implement global operations such as broadcasting values 
or calculating global sums. Global synchronization can be implemented by calls to barrier routines. 
Asynchronous communication is supported by providing calls for probing and waiting for certain 
messages. In MPI-1, all communication operations require the sending as well as the receiving side 
to issue calls to the message passing library. 

2.2 Shared Address Space 

Parallel programming on a snared memory machine can take advantage of the globally shared address 
space. Compilers for shari d memory architectures usually support multi-threaded execution of a 
program. Loop level parallelism can be exploited by using compiler directives such as those defined 
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in the OpenMP standard [14 1. Multiple execution threads are automatically created for performing 
the work in parallel. Data tiansfer between threads is done by direct memory references. OpenMP 
provides a fork/join execution model in which a program begins execution as a single process or 
thread. This thread executes sequentially until a PARALLEL construct is found. At this time, the 
thread creates a team of threads and it becomes its master thread. All threads execute the statements 
lexically enclosed by the parallel construct. Work-sharing constructs (DO, SECTIONS and SINGLE) 
are provided to divide the execution of the enclosed code region among the members of a team. All 
threads are independent and may synchronize at the end of each work-sharing construct or at specific 
points either implicitly or explicitly (specified by the BARRIER directive). Exclusive execution mode 
is also possible through the c efinition of CRITICAL regions. 

This approach provides ; relatively easy way to develop parallel programs but has disadvantages. 
It is often difficult to achieve scalability of the code for a large number of processors due to a lack of 
data locality and excessive synchronization costs. 


3 Hardware Platform and Software Description 

Our test environment consists of a cluster of commodity PCs at the High Performance Computing 
Center of the University of Stuttgart (HLRS). In the following we give some details about hardware 
and system software. 


3.1 Platform descriptn n 

We have used a cluster at HLRS consisting of 8 NEC 120Ed server nodes as the test platform. The 
nodes are dual processor systems with two 1 GHz Pentium III and 2 GB of main memory. Each 
node is equipped with a Myrinet 2000 NIC in a fast 64 bit / 66 MHz PCI slot. The nodes are based 
on the ServerSet III HE chipset and have a good communication performance to the Myrinet cards. 
The bandwidth from memo y to the card is 409 MB/s for read operations and 480 MB/s for w-rite 
operations. These data have been acquired with the program ’gm.debug’ provided by Myricom. A 
collection of data for other motherboards and chipsets can be found at [1]. For our evaluation we used 
only one CPU per node. 

In order to compare the performance of SCASH with a true shared memory system, we used a 
16-way NEC AzusA. The AzusA is a shared memory system with IA-64 processors. Both systems, 
the distributed memory cluster and the shared memory AzusA, were running Linux in its 2.4 ver- 
sion. This reduces effects due to different memory managments of different operating systems on 
the distributed and the shared memory architecture. The performance impact of different memory 

mangement systems is discussed in In [5]. 

We did not have a four < r eight processor IA-32 system available for the tests. 


3.2 SCore 

SCore is a parallel programming environment for workstations and PC clusters, developed by the 
Real World Computing Pan lership (RWCP). The project has now been transferred to the PC Cluster 
Consortium. Amongst othe features, SCore provides its own communication layer called PM [19, 
20]. It aims at providing a uniform interface to different communication devices like Fast Ethernet, 
Gigabit Ethernet or Myrinet 

SCore also supports different parallel programming paradigms like message passing or shared 
memory. On the message passing side there is a MPI-implementation based on MPICH with an addi- 
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tional device specifically de- igned for the PM layer. Shared memory is supported in two ways. The 
PM layer has a shared men ory device that is intended for SMP systems. It uses memory-mapped 
shared segements for the communication between processes on a true shared memory system. Addi- 
tionally, the SCore architecture, has a software distributed shared memory system called SCASH [6], 
that we employed to obtain the results of the tests we present in this paper. 

3.3 SCASH 

SCASH [6] is a page-based software distributed shared memory system. It is implemented as a user- 
space runtime library which uses the PM layer for communicating pages between cluster nodes. 

It employs an eager release consistency model to ensure the consistency of shared memory on a 
per-page basis. This means that at memory synchronisation points only modified parts of memory are 
updated, which usually requ res exchange of data between nodes. 

The home node of a page is the node that keeps the latest data of the page. If other nodes change 
the data within a page it must be updated on the home node. To reduce memory transfer, SCASH also 
provides the possibility to change the home node of a page. It is possible to use two page consistency 
protocols, an invalidate and in update protocol, which can be chosen dynamically. 

To reduce memory transfer between nodes, the nodes use cached copies of requested pages. Only 
on write operations to the memory can these copies become inconsistent. The update protocol speci- 
fies that all copies of a particular page be updated once one node changes its contents. 

In the invalidate protocol, the home node of a page notifies all nodes which share that page when 
a page has been altered and .'ached copies of that page on other nodes become invalid. 

3.4 Omni OpenMP 

Omni OpenMP is a collection of programs and libraries that enable OpenMP for back-end compilers 
that do not support it natively. The front-end to these compilers translates C or Fortran77 OpenMP 
source texts into multi-threaded C with calls to a runtime library. 

One of the main goals t f Omni OpenMP is portability, so the translation pass from an OpenMP 
program to the target code i: written in JAVA. The target code is - in turn - compiled by the back-end 
C compiler on the target platform. For the tests presented here we used the GNU C Compiler as the 
backend compiler. 

The Omni compiler suite can be configured to use several different underlying libraries. For 
the thread system Solaris Threads or pthreads are supported, but there is also support for Stack- 
Threads [18] developed by Real World Computing Project (RWCP). In addition to the support of 
threads there is support for aweral shared memory implementations, like UNIX shmem. In our tests 
we used the support for the SCASH distributed shared memory system which has been described 
above. 

The Omni OpenMP compiler suite is also available for IA-64. For tests on the shared memory 
Azusa system (see 3.1) we used the Omni compiler, too, again in order to minimize the influence of 
different software. This wav we can attribute certain observations to either the DSM system or the 
Omni OpenMP compiler. 

4 Case studies 

For our evaluation we selet ed a subset of the NAS Parallel Benchmarks [3]. They were designed to 
compare the performance ol parallel computers for computational fluid dynamics (CFD) applications. 
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The full suite consists of five benchmark kernels and three simulated CFD applications. We selected 
three of the five benchmark kernels for our study. 

4.1 Evaluation Strategy 

To evaluate the performance of our test environment we compare the timings of OpenMP implemen- 
tations of the benchmark kei nels to: 

1. Timings of their message passing counterparts on the same system. 

2. Timings obtained on a true shared memory system but with the same operating system and 
therefore a comparable memory managment system. 

The first OpenMP versus M J 1 comparison will give us some means to determine how well the DSM 
software handles memory coherency and synchronization. In the MPI implementation access to re- 
mote data is achieved by calls to the message passing library. The user has control over data locality 
and decides when and how much data to communicate. This provides the opportunity to minimize 
communication during program execution. Another aspect of the message passing approach is that 
data communication and sy tchronization are integrated. The send and receive operations not only 
exchange data, but also regulate the progress of the processes. In the OpenMP implementation the 
locations of the data, the amount of data to be communicated, and the synchronization among the 
threads depends on the DSM system and the compiler. As explained in section 3, the DSM system de- 
tects the necessity of communicating data when a page of memory is accessed that has been marked as 
updated by another process. We will use the number of page requests as an indicator for the amount 
of communication in the DSM system. Even in the case where a hand-optimized message passing 
implementation outperform' the DSM system, the ease of application porting may compensate for a 
certain loss of performance 

The comparison of Opei MP on a cluster versus OpenMP on a shared memory node gives us some 
estimate of the speedup that can be expected from the OpenMP programming paradigm on a true 
shared memory architecture Our test platforms are described in section 3. We use the Omni compiler 
on both platforms. 

The benchmarks come in different classes determined by the problem size. We ran only the small 
problems of class S,W, and A. since we encountered some problems with the larger sizes which will 
be discussed in section 5. Since our system is small, consisting of only 8 nodes, it is hazardous to 
extrapolate the scalability studies to larger systems. However, running the very small benchmark 
classes allows us to gain some insight into how the computation to communication ratio impacts the 
performance. 

Since the ease of application porting is an important factor in favor of the DSM system, we started 
out with a sequential version of our benchmark kernels and used the automatic parallelization support 
tool CAPO [10] to insert OpenMP directives, thereby minimizing the parallelization effort. CAPO 
was developed at the NAS/. Ames Research Center. It takes as input a sequential Fortran program. 
It then performs an extensi ,e dependence analysis over statements, loop iterations, and subroutine 
calls and generates Fortran code containing OpenMP directives. CAPO is based on the dependence 
analysis module of the CAPTools [8] parallelization tool. Our starting point for the message pass- 
ing version of the benchmark kernels was the NPB2.3 [4] release of the NAS Parallel benchmarks. 
For the OpenMP implementations we started with an optimized serial implementation of the same 
benchmarks as described ii |9], The structure of the serial code is kept very close to the message 
passing code. Only slight modifications were applied to the kernels considered in our study and we 
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Figure 1: Speedup^ for class A of the EP benchmark for OpenMP/DSM and MPI 

will describe them in the sec ions below. A good description about how to use CAPO for the OpenMP 
parallelization of the benchmarks is given in [10], 

4.2 The EP benchmark kernel 

EP stands for embarrassingly parallel. The kernel generates pairs of Gaussian random deviates ac- 
cording to a specific schemt . As the name suggests, the iterations of the main loop can be executed 
in parallel. Tool based Ope iMP parallelization of the kernel was possible without user interaction. 
Once the data is distributed, the main loop which generates the Gaussian pairs and tallies the counts 
does not require access to remote data except for several global sum reductions at the end. In the MPI 
implementation the global sum is achieved by calls to mpi_allreduce. The OpenMP implementa- 
tion uses the OMP PARAL1 EL REDUCTION directive. The MPI implementation shows a very low 
communication overhead, which is less than 1 % even for the smallest benchmark class on 8 nodes. 
If m denotes the log2 of the number of complex pairs of uniform (0, 1) random numbers, then the 
problem size of the benchmark classes under consideration is: 

Class S: m = 24 

Class W: m = 25 

Class A: m = 28 

The OpenMP/DSM implementation shows a very low number of page requests to the DSM sys- 
tem As expected, the mess; ee passing as well as the OpenMP/DSM implementation show an almost 
linear speedup for all benchmark classes. For 8 nodes the OpenMP/DSM performance ranges w.thin 
97 % to 102% of that of MPI, depending on the benchmarks class. As an example we show the 

speedup for class for class A in fig. 4.2. 

4.3 The CG benchmark kernel 

The CG benchmarks kerne uses a conjugate gradient method to compute an approximation to the 
smallest eigenvalue of a larze, sparse, unstructured matrix. The kernel is useful for testing unstruc- 
tured grid computations an. I communications since the underlying matrix has randomly generated 
locations of entries. Parallelization for message passing and directive based versions occur on the 
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same level within the conjugate gradient algorithm. The basic parallel operations are: sparse matrix 
vector multiply, AXPY operations, and sum reductions. The code was parallelized using CAPO with- 
out any user interaction. If na denotes the number of rows of the sparse matrix and nz the number of 
non-zero elements per row, I hen the problem size of the benchmark classes under consideration are. 
Class S: na = 1400 nz = 7 

Class W: na = 7000 nz = 8 

Class A: na = 14000 nz = 1 1 

In fig. 2 we show the speedup for the three benchmark classes. For class A, the MPI as well as the 
OpenMP/DSM and OpenMP/SMP implementations show reasonable speedup. The OpenMP/SMP 
version shows occasional superlinear speedup due to cache effects. For 8 nodes, the OpenMP/DSM 
efficiency reaches about 759 of that of MPI. The MPI version maintains this speedup for the smaller 
problem sizes but the perfoimance of the OpenMP/DSM version decreases drastically. For 8 nodes 
and class W the OpenMP/DSM efficiency is only 35% and for class S is goes down to 6% yielding a 
speedup of less than 1. 

The class S problem size is far too small to serve as a realistic example. However, we have 
a closer look at the performance differences for this class to get an idea about potential scalability 
issues related to the DSM system. 

Our first observation is that the Omni compiler and its runtime library introduce additonal over- 
head which decreases performance even on a shared memory system. This is demonstrated in fig. 2d, 
where we compare the speedup of class S for the Omni compiler with that of the Intel compiler and 
Guide, which is part of the KAP/Pro ToolSet of Kuck Associates/Intel. 

To analyse the DSM performance we examine the three major time consuming loops within one 
conjugate gradient iteration These loops are the same in the MPI and the OpenMP/DSM implemen- 
tation. They implement a sparse matrix-vector multiplication (MVM), a dot-product (DOT) , and a 
loop combining two AXPY operations and a dot-product. Code examples are shown in fig. 3 

The sparse matrix A is stored in packed format such that indirect addressing is required for matrix 
operations. The sparse matrix- vector multiply is a double-nested loop requiring indirect addressing. 
For OpenMP, it is parallelized by using an OMP PARALLEL DO on the outer loop across the rows 
of the sparse matrix. The dot-product as w ell as the AXPY’s combined with a dot-product are single 
loop nests, using the OpenMP REDUCTION clause to build the global sum. 

The speedups for class > for the three major loops are shown in fig. 4. Both implementations 
suffer from a large commui ication to computation ratio for the single nested loops. However, the 
effect is far more severe for he DSM system. In the MPI version the communication required for the 
global reduction operations s highly optimized by using non-blocking send and receive to minimize 
synchronization overhead, he set of processes that communicate with each other is determined in 
advance. This allows the reduction of the amount of communication within the iteration loop. In 
the OpenMP/DSM implementation, processing the OpenMP REDUCTION clause by the DSM system 
generates a large communication overhead which is indicated by high number of page requests and 
manifestes itself by poor speedup as can be seen in fig. 4. The parallel efficiency is bad for the 
matrix- vector-multiply and < isastrous for the dot-product and AXPY operations. 

We conclude that the peiformance loss for the small size problems is due to: 

1. additonal overhead due to the Omni compiler, 

2. A high communictation to computation ratio which results from short loops and global com- 
muincation operations. 

For the more realistic be tehmark class A the performance of the DSM system is acceptable. 
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Figure 2: Speedups for different classes of the CG benchmarks. In (a) the speedup for OpenMP/DSM 
is shown for classes A, W a id S. The MPI speedup for the same classes is given in (b). The speedup 
for a true shared memory system is presented in (c). (d) shows a comparison of the speedup for class S 
for different compilers on a shared memory platform. The Guide and the Intel compiler both support 
OpenMP natively. 
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Matrix- Vector Product: 

! $omp parallel do delault ( shared) private ( j , k, sum) 
do j=l , lastrow-f irstrow+1 
sum = 0 . dO 

do k=roivstr ( j ) , rowstr ( j+1) -1 

sum = sum + a (k) *p(colidx(k) ) 
enddo 

q ( j ) = sum 

enddo 


Dot-Product 


d = 0 . OdO 

! $omp parallel do cefault ( shared) private(j) reduction (+: d) 
do j=l, lastcol-f irstcol+1 
d = d h p ( j ) *q( j ) 

enddo 

AXPY/Dot-Product Combination 

rho = 0.0(0 

! $omp parallel do < efault ( shared) private(j) reduction (+: rho) 
do j=l, lastcol-f irstcol+1 
z ( j ) = z ( j ) + alpha*p ( j ) 

r(j) = r ( j ) - alpha*q( j ) 
rho = i ho + r ( j ) * r ( j ) 
enddo 


Figure 3: Code examples f< r multiplication, a dot-product, and a loop combining two AXPY opera- 
tions and a dot-product 
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Figure 4: Details of CG bem hmark’s class S. Speedups are shown for the matrix- vector multiplication 
(MVM), for the dot-product (DOT) and AXPY+dot-product (AXPY+DOT). (a) results for DSM, (b) 
results for MPI 


4.4 The FT kernel benchmark 

The FT benchmark is the i omputational kernel of a spectral method based on a 3-D Fast Fourier 
Transform (FFT). During the setup phase the 3D array is filled with random numbers. Unlike in the 
other benchmarks, the setup phase is part of the timed code. The serial implementation of FT code 
was changed to pre-calculal j the values for the loop that initializes each data plane. This enables the 
directive based parallelization of the loop. The main loop in FT could not be parallelized completely 
automatically. Due to the complicated structure of the loop CAPO had to assume data dependencies 
that prevented parallelizatio 1 . In contrast to a compiler CAPO allows interactive user guidance during 
the parallelization process. Parallelization could be achieved by privatizing certain arrays through the 
CAPO user interface. 

If nx, ny, and nz denote the number of gridpoints in each of the spatial dimensisons.the sizes of 
the benchmark classes unde r consideration are given as: 

Class S: nx= 64, nv= 64, nz=64 

Class W: nx=128, ny=128, nz= 32 

Class A: nx=256, n>=256, nz=128 

The speedup for OpenM P/DSM, MPI, and OpenMP/SMP versions for our three benchmark classes 
is shown in fig. 5. For 8 nodes the OpenMP/DSM implementation achieves about 70% of the MPI 
speedup, for class W 65% and for class S 50%. The OpenMP/DSM speedup is limited to about 4 out 
of 8 processes compared to 6 out of 8 for the MPI implementation. To understand the performance 
difference we examine the different steps of the FT benchmarks in detail. In both implementations, 
the 3-D FFT is accomplished by performing a 1-D FFT in each of the three spatial dimensions. For 
each spatial dimension the three-dimensional array is copied into a one-dimensional array, the FFT is 
performed on the one-dime lstonal array, and the result is copied back. A code fragment for the first 
dimension is shown fig. 6. 

The OpenMP parallels ition is achieved by inserting an OMP PARALLEL DO on the outermost 
loop. This results in a dist ibution of the data in dimension of K corresponding to the z-direction. 
The speedup for the individual three spatial dimensions for the OpenMP implementation on the class 
A benchmark is shown in tig 7. While the FFT in x and y dimension reach a speedup of 6 out of 
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Figure 5: Comparison of MI ‘I and OpenMP/DSM speedups for classes A, W and S of the FT bench- 
mark. (a) Speedup for Open MP/DSM, (b) MPI Speedup, (c) Speedup on the SMP system 


do k = 1, n3 

do j = L , n2 
do i -- 1 , nl 

vj ; i ) = u { i , j , k ) 

end c'i ) 

call t ft (w, . . . ) 

do i = 1 , d ( 1 ) 

u ( i , j , k ) = w ( i ) 

endd ) 

enddo 

enddo 


Figum 6: Code fragment for the first dimension of FFT 
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8, the speedup in z-dimensi in is only 2 out of 8. The performance loss in X and Y dimension is 
mostly due to communication caused by writing to the shared array U which is indicated by page 
requests within this loop. Logically there is no communication required for this loop, since only the 
local part of the array is accessed. The performance decrease for the z-dimension is due to the fact 
that here the outermost loop of the loop nest from fig. 6 runs in J and not in K dimension. Since 
the data was distributed in K dimension, parallel execution of the loop requires access to remote data 
and causes a large number of page requests. The MPI implementation performs a transpose of the 
three-dimensional array in z dimension, which is achieved by a call to MPI^ALLTOALL. This causes 
some decrease in performance, but not as severe as in the DSM system. 



Figure 7: Speedup for different directions of the FFT on the DSM system 


5 Problems encountered 

The installation of SCore, SCASH and Omni OpenMP was rather straight forward. For the basic 
SCore installation we tried to use aggressive compiler optimizations whenever possible and we went 
through an iterative process o find a stable configuration in terms of compiler settings. The SCASH 
and Omni OpenMP configurations were based on the one found for the basic SCore system. We were 
able to run all tests and examples delivered with either SCASH or the Omni OpenMP compiler suite 
successfully. 

We ran into problems when trying to run the three kernel benchmarks EP, CG, and FT tor larger 
problem sizes such as they are given class B or C. We also could not run any of the simulated CFD 
applications BT, SP, and LIJ ihat are part of the benchmark suite, even for the small problem size given 
in class A. The problems wc encountered were due to the fact that SCASH was not able to allocate 
enough of virtual memory. The SCASH system itself uses a large amount of memory for its own 
memory managment on top of the one provided by the operating sysstem. To improve data exchange 
performance (i.e. bandwidth and latency) SCASH specifically allocates pin-down memory [21]. For 
larger benchmark classes it seems that there is not enough pinnable memory available. 

Another severe restriction is the 32 bit address-space of the IA32 architecture. With 32 bit ad- 
dresses the address-space is restricted to at most 2 32 addresses. Usually the memory managment of 
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operating systems like Linu< or Windows * allows a process to use only part of this address-space 
for its private data. The opeiating system uses the rest to mirror some internal data structures into the 
process’ virtual address-spai e. Under Linux a process can only use 2 GB of the theoretical maximum 
of 4 GB for its private data. 

Without additional eff or, the kernel itself would suffer from this 4 GB barrier. To enable the use 
of more main memory, on IA32 Linux uses the PAE capabilities of modem processors to access up to 
64 GB. This is achieved by laving a three stage page address translation mechanism. But even with 
this system, only the kernel ran handle more than 4 GB. A single process is still restricted to 2 GB of 
private memory. 

A software distributed shared memory system like SCASH that runs in user-mode and uses a 32 
bit global address-space will therefore be restricted to a maximum of 4 GB global shared memory. 

6 Related Work 

Another system supporting t ie OpenMP paradigm on distributed memory systems is TreadMarks [2]. 
Comparisons of the TreadMarks systems with message passing programming are given in [7] and 
[11]. Other systems that support software DSM programming are Cashmere [17] and SMP-Shasta [?] 
There are a number of papers reporting on comparisons of different programming paradigms. As 
an example we name [15] and [16] where message passing and shared memory programming are 
compared on shared memory architectures. 


7 Conclusions and Future Work 

We have measured the perfi rmance of OpenMP/DSM implementations of three of the NAS Parallel 
Benchmarks on a commodity cluster of PCs, and we compared the speedup to corresponding MPI 
implementations of the same algorithms. The difference in performance depends on the structure of 
the application and the prob ern size. For the largest problem sizes under consideration the observed 
OpenMP/DSM speedups ra age between 100% and 70% of the MPI speedup for all benchmarks. 
Only in cases whith an extremely high communication to computation ratio does the OpenMP/DSM 
speedup go down to less th; n 10% of MPI. This occurs in the smallest class of the CG benchmark, 
where AXPY and dot-product operations for short vector lengths are being parallelized. We have 
noticed that in this extreme jase part of the performance decrease was due to compiler deficiencies 
which also show on a shared memory system. The memory problems described in section 5 are 
implementation dependent and we expect them to be resolved in commercial software. Usage of 64 
bit system sofwtare and kernel enhancements to support DSM on a system level will improve the 
general usability of DSM sy .terns. 

All in all we are encouraged by the results we obtained considering the fact that we were using 
public domain software. The DSM system allowed us to take exploit parallelism over all nodes of 
the cluster by using automatically parallelized code based on OpenMP. We find the performance dif- 
ferences when compared with hand-optimized MPI code acceptable when we take into account the 
extremely short developeme it time of the parallel code. Our future plan is to run full size aplications 
in our testbed environment. 

1 Windows is a registered trademark of Microsoft Corp 
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Distributed vs Shared Memory Parallelism 

• Distributed Memory 

• Shared Memory 


► Commodity hardware 

► Globally shared 


► Commodity network 

address space 


► Low cost alternative 

► Parallelization via 


for high end scientific 

compiler directives 


computing 

► Incremental 


► Currently difficult to 

parallelization possible 


program: 

► High cost of hardware 


• Data distribution 

► Limited scalability 


• Message Passing 



required 
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Distributed Shared Memoiy (DSM) 


Software for distributed 
memory architecture 
Enables shared memory 
programming 
Combines 
Examples are: 

► TreadMarks 

► Scash 



Ease of programming 
(shared memory) I 


Performance? 


Test Environment: Hardware 


• PC Cluster at HLRS 

• 8 NEC 120Ed server nodes. 

• Each Node: 

► Dual processor 

► 1 GHz Pentium III 

► 2 GB main memory 

• Network: 

► Myrinet 2000 NIC 

► 64 bit/66 MHz PCI slot 

• Bandwidth: 

► 409 MB/s read 

► 480 MB/s write 
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The FT Benchmark 
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Future Work: 


• Run full applications under DSM 

• Try commercial DSM software once it 
becomes available (I. E. KAI/Pro Toolset 
Network Edition) 
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